1 00:00:00,000 --> 00:00:00,030 2 00:00:00,030 --> 00:00:02,430 The following content is provided under a Creative 3 00:00:02,430 --> 00:00:03,830 Commons license. 4 00:00:03,830 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue to 5 00:00:06,860 --> 00:00:10,560 offer high-quality educational resources for free. 6 00:00:10,560 --> 00:00:13,410 To make a donation or view additional materials from 7 00:00:13,410 --> 00:00:17,190 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:17,190 --> 00:00:18,440 ocw.mit.edu. 9 00:00:18,440 --> 00:00:22,530 10 00:00:22,530 --> 00:00:23,030 PROFESSOR: OK. 11 00:00:23,030 --> 00:00:26,790 So today we're going to continue on with some of the 12 00:00:26,790 --> 00:00:28,260 design patterns that we started 13 00:00:28,260 --> 00:00:30,040 talking about last week. 14 00:00:30,040 --> 00:00:34,780 So to recap, there are really four common steps to taking a 15 00:00:34,780 --> 00:00:36,380 program and then parallelizing it. 16 00:00:36,380 --> 00:00:39,480 Often you're starting off with a program that's designed or 17 00:00:39,480 --> 00:00:41,060 written in a sequential manner. 18 00:00:41,060 --> 00:00:44,420 And what you want to do is find tasks in the program -- 19 00:00:44,420 --> 00:00:47,230 and these are sort of independent work pieces that 20 00:00:47,230 --> 00:00:49,260 you are going to be able to decompose from 21 00:00:49,260 --> 00:00:50,950 your sequential code. 22 00:00:50,950 --> 00:00:52,520 You're going to group tasks together 23 00:00:52,520 --> 00:00:54,930 into threads or processes. 24 00:00:54,930 --> 00:00:57,480 And then you'll essentially map each one of these threads 25 00:00:57,480 --> 00:00:59,640 or processes down to the actual hardware. 26 00:00:59,640 --> 00:01:02,200 And that will get you, eventually when these programs 27 00:01:02,200 --> 00:01:04,500 run, the concurrency and the performance 28 00:01:04,500 --> 00:01:05,750 speedups that you want. 29 00:01:05,750 --> 00:01:08,680 30 00:01:08,680 --> 00:01:12,120 So as a reminder of what I talked about last week in 31 00:01:12,120 --> 00:01:15,000 terms of finding the task or finding the concurrency, you 32 00:01:15,000 --> 00:01:16,920 start off with an application. 33 00:01:16,920 --> 00:01:19,130 You come up with a block level diagram. 34 00:01:19,130 --> 00:01:23,120 And from that you sort of try to understand where the time 35 00:01:23,120 --> 00:01:25,520 is spent in the computations and what are some typical 36 00:01:25,520 --> 00:01:27,860 patterns for how the computations are carried out. 37 00:01:27,860 --> 00:01:31,680 So we talked about task decomposition or sort of 38 00:01:31,680 --> 00:01:34,115 independent tasks or tasks that might be different that 39 00:01:34,115 --> 00:01:35,540 the application is carrying out. 40 00:01:35,540 --> 00:01:38,310 So in the MPEG encoder, we looked at decoding the motion 41 00:01:38,310 --> 00:01:42,960 vectors for temporal compression versus spatial 42 00:01:42,960 --> 00:01:43,470 compression. 43 00:01:43,470 --> 00:01:48,050 It does sort of substantially different work. 44 00:01:48,050 --> 00:01:49,820 We talked about data decomposition. 45 00:01:49,820 --> 00:01:53,690 So if you're doing a process -- so if you have some work 46 00:01:53,690 --> 00:01:56,740 that's really consuming a large chunk of data, and you 47 00:01:56,740 --> 00:01:59,160 realize that it's applying the same kind of work to each of 48 00:01:59,160 --> 00:02:02,640 those data pieces, then you can partition your data into 49 00:02:02,640 --> 00:02:06,480 smaller subsets and apply the same function 50 00:02:06,480 --> 00:02:07,730 over and over again. 51 00:02:07,730 --> 00:02:13,970 52 00:02:13,970 --> 00:02:16,860 So in the motion compensation phase, that's one example 53 00:02:16,860 --> 00:02:19,590 where you can replicate the function and split up the data 54 00:02:19,590 --> 00:02:22,230 stream in different ways and have these 55 00:02:22,230 --> 00:02:23,920 tasks proceed in parallel. 56 00:02:23,920 --> 00:02:25,250 So that's data decomposition. 57 00:02:25,250 --> 00:02:28,120 And then we talked a little bit about sort of making a 58 00:02:28,120 --> 00:02:29,880 case for a pipeline decomposition. 59 00:02:29,880 --> 00:02:32,530 So you have a data assembly line or producer-consumer 60 00:02:32,530 --> 00:02:35,270 chains, and you essentially want to recognize those in 61 00:02:35,270 --> 00:02:39,140 your computation and make it so that you can exploit them 62 00:02:39,140 --> 00:02:41,220 eventually when you're doing your mapping 63 00:02:41,220 --> 00:02:44,030 down to actual hardware. 64 00:02:44,030 --> 00:02:46,210 But what does it mean for two tasks to actually be 65 00:02:46,210 --> 00:02:47,700 concurrent? 66 00:02:47,700 --> 00:02:49,980 And how do you know that you can safely actually run two 67 00:02:49,980 --> 00:02:51,230 tasks in parallel? 68 00:02:51,230 --> 00:02:54,320 So there's something I crudely went over last time. 69 00:02:54,320 --> 00:02:58,180 So as to make it more concrete, highlighting 70 00:02:58,180 --> 00:03:01,960 Bernstein's condition, which says that given two tasks, if 71 00:03:01,960 --> 00:03:06,630 the input set to one task is different from or does not 72 00:03:06,630 --> 00:03:11,530 intersect with the output set of another, and vice versa, 73 00:03:11,530 --> 00:03:13,910 and neither task sort of updates the same data 74 00:03:13,910 --> 00:03:17,660 structures in memory, then there's really no dependency 75 00:03:17,660 --> 00:03:18,470 issues between them. 76 00:03:18,470 --> 00:03:21,650 You can run them safely in parallel. 77 00:03:21,650 --> 00:03:26,150 So task T1 and T2, if all the data that's consumed by T1, so 78 00:03:26,150 --> 00:03:29,840 all the data elements that are read by T1 are different from 79 00:03:29,840 --> 00:03:33,020 the ones that are read by T2, then you have -- 80 00:03:33,020 --> 00:03:37,130 you know, if T2 is running in parallel, there's really no 81 00:03:37,130 --> 00:03:40,240 problem with T1 because it's updating the 82 00:03:40,240 --> 00:03:41,360 orthogonal data set. 83 00:03:41,360 --> 00:03:45,060 Similarly for T2 and T1, any outputs are different. 84 00:03:45,060 --> 00:03:49,480 So as an example, let's say you have two tasks. 85 00:03:49,480 --> 00:03:52,240 In T1 you're doing some basic statements. 86 00:03:52,240 --> 00:03:54,850 And these could be essentially more coarse grained. 87 00:03:54,850 --> 00:03:56,580 There could be a lot more computation in here. 88 00:03:56,580 --> 00:04:00,390 I just simplified it for the illustration. 89 00:04:00,390 --> 00:04:02,980 So you have task a equals x plus y. 90 00:04:02,980 --> 00:04:06,300 And task two does b equals x plus z. 91 00:04:06,300 --> 00:04:09,660 So if we look at the read set for T1, these are all the 92 00:04:09,660 --> 00:04:13,170 variables or data structures or addresses these that are 93 00:04:13,170 --> 00:04:14,490 read by the first task. 94 00:04:14,490 --> 00:04:16,320 So that's x and y here. 95 00:04:16,320 --> 00:04:19,480 And all the data that's written or produced by T1. 96 00:04:19,480 --> 00:04:22,320 So here we're just producing one data value. 97 00:04:22,320 --> 00:04:24,830 And that's going into location A. 98 00:04:24,830 --> 00:04:26,670 Similarly we can come up with the read set and 99 00:04:26,670 --> 00:04:28,510 write set for T2. 100 00:04:28,510 --> 00:04:32,220 And so that's shown on here. 101 00:04:32,220 --> 00:04:35,530 So we have -- task T2 has x plus z in its read set. 102 00:04:35,530 --> 00:04:37,440 And it produces one data value, b. 103 00:04:37,440 --> 00:04:40,375 If we take the intersection of the read and write sets for 104 00:04:40,375 --> 00:04:43,550 the different tasks, then they're empty. 105 00:04:43,550 --> 00:04:45,440 I read something completely different than what's produced 106 00:04:45,440 --> 00:04:47,510 in this task and vice versa. 107 00:04:47,510 --> 00:04:49,190 And they write to two completely 108 00:04:49,190 --> 00:04:50,530 different memory locations. 109 00:04:50,530 --> 00:04:53,020 So I can essentially parallelize these or run these 110 00:04:53,020 --> 00:04:54,810 two tasks in parallel. 111 00:04:54,810 --> 00:04:58,260 So you can extend this analysis. 112 00:04:58,260 --> 00:05:01,470 And compilers can actually use this condition to determine 113 00:05:01,470 --> 00:05:03,300 when two tasks can be parallelized if you're doing 114 00:05:03,300 --> 00:05:05,380 automatic parallelization. 115 00:05:05,380 --> 00:05:07,960 And you'll probably hear more about these later on. 116 00:05:07,960 --> 00:05:10,680 117 00:05:10,680 --> 00:05:13,990 And so what I focused on last time were the finding 118 00:05:13,990 --> 00:05:16,530 concurrency patterns. 119 00:05:16,530 --> 00:05:20,630 And I had identified sort of four design spaces based on 120 00:05:20,630 --> 00:05:24,310 the work that's outlined in the book by Mattson, Sanders, 121 00:05:24,310 --> 00:05:26,160 and Massingill. 122 00:05:26,160 --> 00:05:31,750 And so starting with two large sort of concepts. 123 00:05:31,750 --> 00:05:35,210 The first helps you figure out how you're going to actually 124 00:05:35,210 --> 00:05:36,310 express your algorithm. 125 00:05:36,310 --> 00:05:38,500 So first you find your concurrency and then you 126 00:05:38,500 --> 00:05:39,640 organize in some way. 127 00:05:39,640 --> 00:05:42,240 And so we're going to talk about that in more detail. 128 00:05:42,240 --> 00:05:44,790 And then once you've organized your tasks in some way that 129 00:05:44,790 --> 00:05:47,800 actually expresses your overall computation, you need 130 00:05:47,800 --> 00:05:52,670 some software construction utilities or data structures 131 00:05:52,670 --> 00:05:55,810 or mechanisms for actually orchestrating computations for 132 00:05:55,810 --> 00:05:57,190 which they have also abstracted 133 00:05:57,190 --> 00:05:58,560 out some common patterns. 134 00:05:58,560 --> 00:06:01,840 And so I'll briefly talk about these as well. 135 00:06:01,840 --> 00:06:03,870 And so on your algorithm expression side, these are 136 00:06:03,870 --> 00:06:07,740 essentially conceptualization steps that help you abstract 137 00:06:07,740 --> 00:06:08,500 out your problem. 138 00:06:08,500 --> 00:06:12,490 And you may in fact think about your algorithm 139 00:06:12,490 --> 00:06:14,710 expression in different ways to expose different kinds of 140 00:06:14,710 --> 00:06:17,960 concurrency or to be able to explore different ways of 141 00:06:17,960 --> 00:06:20,610 mapping the concurrency to hardware. 142 00:06:20,610 --> 00:06:23,540 And so for construction it's more about actual engineering 143 00:06:23,540 --> 00:06:25,200 and implementation. 144 00:06:25,200 --> 00:06:26,550 So here you're actually thinking about what do the 145 00:06:26,550 --> 00:06:28,210 data structures look like? 146 00:06:28,210 --> 00:06:30,050 What is the communication pattern going to look like? 147 00:06:30,050 --> 00:06:32,660 Am I but going to use things like MPI or OpenMP? 148 00:06:32,660 --> 00:06:36,270 What does that help me with in terms of doing my 149 00:06:36,270 --> 00:06:37,520 implementation? 150 00:06:37,520 --> 00:06:40,320 151 00:06:40,320 --> 00:06:42,800 So given a collection of concurrent tasks -- so you've 152 00:06:42,800 --> 00:06:46,080 done your first step in your four design patterns. 153 00:06:46,080 --> 00:06:47,510 You know, what is your next step? 154 00:06:47,510 --> 00:06:49,390 And that's really mapping those tasks that you've 155 00:06:49,390 --> 00:06:52,600 identified down to some sort of execution units. 156 00:06:52,600 --> 00:06:54,660 So threads are very common. 157 00:06:54,660 --> 00:06:56,970 This is essentially what we've been using on Cell. 158 00:06:56,970 --> 00:06:59,960 We take our computation and we wrap it into SPE threads and 159 00:06:59,960 --> 00:07:03,500 then we can execute those at run time. 160 00:07:03,500 --> 00:07:05,900 So some things to keep in mind -- although you shouldn't over 161 00:07:05,900 --> 00:07:09,600 constrain yourself in terms of these considerations. 162 00:07:09,600 --> 00:07:12,150 What is the magnitude of your parallelism that 163 00:07:12,150 --> 00:07:12,580 you're going to get? 164 00:07:12,580 --> 00:07:15,030 You know, do you want hundreds or thousands of threads? 165 00:07:15,030 --> 00:07:18,630 Or do you want something on the order of tens? 166 00:07:18,630 --> 00:07:22,030 And this is because you don't want to overwhelm the intended 167 00:07:22,030 --> 00:07:23,540 system that you're going to run on. 168 00:07:23,540 --> 00:07:27,010 So we talked about yesterday on Cell processor, if you're 169 00:07:27,010 --> 00:07:29,960 creating a lot more than six threads, then you can create 170 00:07:29,960 --> 00:07:32,660 problems or you essentially don't get extra parallelism 171 00:07:32,660 --> 00:07:35,710 because each thread is running to completion on each SPE. 172 00:07:35,710 --> 00:07:38,140 And contact switch overhead is extremely high. 173 00:07:38,140 --> 00:07:41,010 So you don't want to spend too much engineering cost to come 174 00:07:41,010 --> 00:07:42,960 up with an algorithm implementation that's 175 00:07:42,960 --> 00:07:45,940 massively scalable to hundreds or thousands of threads when 176 00:07:45,940 --> 00:07:47,670 you can't actually exploit it. 177 00:07:47,670 --> 00:07:49,940 But that doesn't mean that you should over constrain your 178 00:07:49,940 --> 00:07:53,000 implementation to where if now I want to take your code and 179 00:07:53,000 --> 00:07:55,900 run it on a different machine, I essentially have to redesign 180 00:07:55,900 --> 00:07:58,410 or re-engineer the complete process. 181 00:07:58,410 --> 00:08:00,646 So you want to avoid tendencies to over constrain 182 00:08:00,646 --> 00:08:02,000 the implementation. 183 00:08:02,000 --> 00:08:04,340 And you want to leave your code in a way that's malleable 184 00:08:04,340 --> 00:08:07,580 so that you can easily make changes to sort of factor in 185 00:08:07,580 --> 00:08:10,340 new platforms that you want to run on or new machine 186 00:08:10,340 --> 00:08:14,500 architecture features that you might want to exploit. 187 00:08:14,500 --> 00:08:17,670 So there are three major organization principles I'm 188 00:08:17,670 --> 00:08:20,060 going to talk about. 189 00:08:20,060 --> 00:08:22,280 And none of these should be sort of foreign to you at this 190 00:08:22,280 --> 00:08:24,950 point because we've talked about them in different ways 191 00:08:24,950 --> 00:08:28,510 in the recitations or in previous lectures. 192 00:08:28,510 --> 00:08:31,240 And it's really, what is it that determines sort of the 193 00:08:31,240 --> 00:08:33,290 algorithm structure based on the set of tasks that you're 194 00:08:33,290 --> 00:08:36,070 actually carrying out in your computation? 195 00:08:36,070 --> 00:08:38,180 And so there's the principle that says, 196 00:08:38,180 --> 00:08:40,450 organize things by tasks. 197 00:08:40,450 --> 00:08:42,030 I'm going to talk to that. 198 00:08:42,030 --> 00:08:43,980 And then there's a principle that says, well, organize 199 00:08:43,980 --> 00:08:46,700 things by how you're doing the data decomposition. 200 00:08:46,700 --> 00:08:50,090 So in this case how you're actually distributing the data 201 00:08:50,090 --> 00:08:53,260 or how to the data is laid out in memory, or how you're 202 00:08:53,260 --> 00:08:55,510 partitioning the data to actually compute on it 203 00:08:55,510 --> 00:08:58,330 dictates how you should actually organize your actual 204 00:08:58,330 --> 00:08:59,420 computation. 205 00:08:59,420 --> 00:09:01,780 And then there's organize by flow of data. 206 00:09:01,780 --> 00:09:04,870 And this is something you'll hear about more in the next 207 00:09:04,870 --> 00:09:06,540 lecture where we're talking about streaming. 208 00:09:06,540 --> 00:09:12,210 But in this pattern if there are specific sort of 209 00:09:12,210 --> 00:09:16,230 computations that take advantage of high bandwidth 210 00:09:16,230 --> 00:09:18,560 flow of data between computations, you might want 211 00:09:18,560 --> 00:09:20,520 to exploit that for concurrency. 212 00:09:20,520 --> 00:09:23,540 And we'll talk about that as well. 213 00:09:23,540 --> 00:09:24,100 OK. 214 00:09:24,100 --> 00:09:27,070 So a design diagram for how can you 215 00:09:27,070 --> 00:09:28,660 actually go through process. 216 00:09:28,660 --> 00:09:31,250 So you can ask yourself a set of questions. 217 00:09:31,250 --> 00:09:35,425 if I want to organize things by tasks, then there are 218 00:09:35,425 --> 00:09:38,530 essentially two main clusters or two main computations, two 219 00:09:38,530 --> 00:09:40,010 main patterns. 220 00:09:40,010 --> 00:09:43,700 If the code is recursive, then you essentially want to apply 221 00:09:43,700 --> 00:09:47,220 a divide and conquer pattern or divide and conquer 222 00:09:47,220 --> 00:09:48,430 organization. 223 00:09:48,430 --> 00:09:49,780 If it's not recursive. 224 00:09:49,780 --> 00:09:54,990 then you essentially want to do task parallelism. 225 00:09:54,990 --> 00:09:56,400 So in task parallelism -- 226 00:09:56,400 --> 00:09:58,620 you know, I've listed two examples here. 227 00:09:58,620 --> 00:09:59,850 But really any of the things that we've talked 228 00:09:59,850 --> 00:10:01,810 about in the past fit. 229 00:10:01,810 --> 00:10:03,940 Ray computation, ray tracing. 230 00:10:03,940 --> 00:10:05,950 So here you're shooting rays through a scene to try to 231 00:10:05,950 --> 00:10:09,420 determine how to render it. 232 00:10:09,420 --> 00:10:11,950 And really each ray is a separate and independent 233 00:10:11,950 --> 00:10:13,570 computation step. 234 00:10:13,570 --> 00:10:16,070 In molecular dynamics you're trying to determine the 235 00:10:16,070 --> 00:10:17,660 non-bonded force calculations. 236 00:10:17,660 --> 00:10:20,160 There are some dependencies, but really you can do each 237 00:10:20,160 --> 00:10:23,340 calculation for one molecule or for one atom 238 00:10:23,340 --> 00:10:25,080 independent of any other. 239 00:10:25,080 --> 00:10:27,600 And then there are sort of the global dependence of having to 240 00:10:27,600 --> 00:10:30,530 update or communicate across all those molecules that sort 241 00:10:30,530 --> 00:10:33,350 of reflect new positions in the system. 242 00:10:33,350 --> 00:10:38,500 So the common factors here are your tasks are associated with 243 00:10:38,500 --> 00:10:39,850 iterations of a loop. 244 00:10:39,850 --> 00:10:43,460 And you can distribute, you know, each process -- 245 00:10:43,460 --> 00:10:44,760 each processor can do a different 246 00:10:44,760 --> 00:10:47,150 iteration of the loop. 247 00:10:47,150 --> 00:10:51,170 And often you know sort of what the tasks are before you 248 00:10:51,170 --> 00:10:52,850 actually start your computation. 249 00:10:52,850 --> 00:10:55,550 Although in some cases, like in ray tracing, you might 250 00:10:55,550 --> 00:10:59,040 generate more and more threads as you go along, or more and 251 00:10:59,040 --> 00:11:01,130 more computations because as the ray is shooting off, 252 00:11:01,130 --> 00:11:05,680 you're calculating new reflections. 253 00:11:05,680 --> 00:11:08,800 And that creates sort of extra work. 254 00:11:08,800 --> 00:11:11,330 But largely you have these independent tasks that you can 255 00:11:11,330 --> 00:11:14,680 encapsulate in threads and you run them. 256 00:11:14,680 --> 00:11:17,650 And this is sort of -- it might appear subtle, but there 257 00:11:17,650 --> 00:11:20,730 are algorithm classes where not all tasks essentially need 258 00:11:20,730 --> 00:11:23,160 to complete for you to arrive at a solution. 259 00:11:23,160 --> 00:11:24,462 You know, in some cases you might convert to 260 00:11:24,462 --> 00:11:25,960 an acceptable solution. 261 00:11:25,960 --> 00:11:31,240 And you don't actually need to go through and exercise all 262 00:11:31,240 --> 00:11:33,630 the computation that's outstanding for you to say the 263 00:11:33,630 --> 00:11:34,760 program is done. 264 00:11:34,760 --> 00:11:36,230 So there will be a tricky issue -- 265 00:11:36,230 --> 00:11:38,700 I'll revisit this just briefly later on -- is how do you 266 00:11:38,700 --> 00:11:40,280 determine if your program has actually 267 00:11:40,280 --> 00:11:43,300 terminated or has completed? 268 00:11:43,300 --> 00:11:47,910 In divide and conquer, this is really for recursive programs. 269 00:11:47,910 --> 00:11:51,340 You know, you can think of a well-known sorting algorithm, 270 00:11:51,340 --> 00:11:53,900 merge sort, that classically fits into this kind of 271 00:11:53,900 --> 00:11:57,350 picture, where you have some really large array of data 272 00:11:57,350 --> 00:11:58,280 that you want to sort. 273 00:11:58,280 --> 00:12:01,330 You keep subdividing into smaller and smaller chunks 274 00:12:01,330 --> 00:12:03,610 until you can do local reorderings. 275 00:12:03,610 --> 00:12:06,320 And then you start merging things together. 276 00:12:06,320 --> 00:12:09,460 So this gives you sort of a way to take a problem, divide 277 00:12:09,460 --> 00:12:12,620 it into subproblems. And then you can split the data at some 278 00:12:12,620 --> 00:12:13,990 point and then you join it back together. 279 00:12:13,990 --> 00:12:15,540 You merge it. 280 00:12:15,540 --> 00:12:20,160 You might see things like fork and merge or fork and join 281 00:12:20,160 --> 00:12:22,330 used instead of split and join. 282 00:12:22,330 --> 00:12:25,790 I've used the terminology that sort of melds well with some 283 00:12:25,790 --> 00:12:28,210 of the concepts we use in streaming that you'll see in 284 00:12:28,210 --> 00:12:29,780 the next lecture. 285 00:12:29,780 --> 00:12:32,580 And so in these kinds of programs, it's not always the 286 00:12:32,580 --> 00:12:35,510 case that each subproblem will have essentially the same 287 00:12:35,510 --> 00:12:36,710 amount of work to do. 288 00:12:36,710 --> 00:12:41,050 You might need more dynamic load balancing because each 289 00:12:41,050 --> 00:12:42,390 subproblem -- 290 00:12:42,390 --> 00:12:45,210 how you distribute the data might lead you to do more work 291 00:12:45,210 --> 00:12:47,720 in one problem than in the other. 292 00:12:47,720 --> 00:12:50,950 So as opposed to some of the other mechanisms where static 293 00:12:50,950 --> 00:12:53,220 load balancing will work really well -- 294 00:12:53,220 --> 00:12:56,340 and to remind you, static load balancing essentially says, 295 00:12:56,340 --> 00:12:58,610 you have some work, you assign it to each of the processors. 296 00:12:58,610 --> 00:13:02,180 And you're going to be relatively happy with how each 297 00:13:02,180 --> 00:13:04,000 processor's sort of utilization is 298 00:13:04,000 --> 00:13:05,240 going to be over time. 299 00:13:05,240 --> 00:13:07,430 Nobody's going to be too overwhelmed with the amount of 300 00:13:07,430 --> 00:13:08,830 work they have to do. 301 00:13:08,830 --> 00:13:11,630 In this case, you might end up with needing some things for 302 00:13:11,630 --> 00:13:14,580 dynamic load balancing that says, I'm unhappy with the 303 00:13:14,580 --> 00:13:16,700 work performance or utilization. 304 00:13:16,700 --> 00:13:19,140 Some processors are more idle than the others, so you might 305 00:13:19,140 --> 00:13:21,570 want to essentially redistribute things. 306 00:13:21,570 --> 00:13:24,060 So what we'll talk about -- you know, how does this 307 00:13:24,060 --> 00:13:29,310 concept of divide and conquer parallelization pattern work 308 00:13:29,310 --> 00:13:30,910 into the actual implementation? 309 00:13:30,910 --> 00:13:35,310 You know, how do I actually implement a divide and conquer 310 00:13:35,310 --> 00:13:37,060 organization? 311 00:13:37,060 --> 00:13:40,590 The next organization is organized by data. 312 00:13:40,590 --> 00:13:43,450 So here you have some computation -- 313 00:13:43,450 --> 00:13:44,370 not sure why it's flickering. 314 00:13:44,370 --> 00:13:45,398 AUDIENCE: Check your -- 315 00:13:45,398 --> 00:13:47,455 maybe your VGA cables aren't in good. 316 00:13:47,455 --> 00:13:50,540 317 00:13:50,540 --> 00:14:00,450 PROFESSOR: So in the organize by data, you essentially want 318 00:14:00,450 --> 00:14:03,400 to apply this if you have a lot of computation that's 319 00:14:03,400 --> 00:14:06,040 using a shared global data structure or that's going to 320 00:14:06,040 --> 00:14:08,590 update a central data structure. 321 00:14:08,590 --> 00:14:11,940 So in molecular dynamics, for example, you have a huge array 322 00:14:11,940 --> 00:14:15,150 that records the position of each of the molecules. 323 00:14:15,150 --> 00:14:17,310 And while you can do the coarse calculations 324 00:14:17,310 --> 00:14:21,160 independently, eventually all the parallel tasks have to 325 00:14:21,160 --> 00:14:23,700 communicate with the central data structure and say, here 326 00:14:23,700 --> 00:14:25,980 are the new locations for all the molecules. 327 00:14:25,980 --> 00:14:29,070 And so that has to go into a central repository. 328 00:14:29,070 --> 00:14:32,770 And there are different kinds of sort of decompositions 329 00:14:32,770 --> 00:14:36,160 within this organization. 330 00:14:36,160 --> 00:14:40,650 If your data structure is recursive, so a link list or a 331 00:14:40,650 --> 00:14:42,750 tree or a graph, then you can apply the 332 00:14:42,750 --> 00:14:44,630 recursive data pattern. 333 00:14:44,630 --> 00:14:48,250 If it's not, if it's linear, like an array or a vector, 334 00:14:48,250 --> 00:14:50,450 then you apply geometric decomposition. 335 00:14:50,450 --> 00:14:53,560 And you've essentially seen geometric decomposition. 336 00:14:53,560 --> 00:14:56,990 These were some of the labs that you've already done. 337 00:14:56,990 --> 00:14:59,850 And so the example from yesterday's recitation, you're 338 00:14:59,850 --> 00:15:04,070 doing an end body simulation in terms of who is gravitating 339 00:15:04,070 --> 00:15:06,650 towards who, you're calculating the forces between 340 00:15:06,650 --> 00:15:07,670 pairs of objects. 341 00:15:07,670 --> 00:15:11,060 And depending on the force that each object feels, you 342 00:15:11,060 --> 00:15:16,170 calculate a new motion vector. 343 00:15:16,170 --> 00:15:20,160 And you use that to update the position of each body in your, 344 00:15:20,160 --> 00:15:22,790 say, galaxy that you're simulating. 345 00:15:22,790 --> 00:15:25,940 And so what we talked about yesterday was given an array 346 00:15:25,940 --> 00:15:30,050 of positions, each processor gets a sub-chunk of that 347 00:15:30,050 --> 00:15:31,050 position array. 348 00:15:31,050 --> 00:15:34,280 And it knows how to calculate sort of 349 00:15:34,280 --> 00:15:35,240 locally, based on that. 350 00:15:35,240 --> 00:15:38,500 And then you might also communicate local chunks to do 351 00:15:38,500 --> 00:15:39,860 more scalable computations. 352 00:15:39,860 --> 00:15:43,930 353 00:15:43,930 --> 00:15:46,850 And recursive data structure are a little bit more tricky. 354 00:15:46,850 --> 00:15:49,910 So at face value you might think that there's really no 355 00:15:49,910 --> 00:15:51,580 kind of parallelism you can get out of a 356 00:15:51,580 --> 00:15:52,970 recursive data structure. 357 00:15:52,970 --> 00:15:55,310 So if you're iterating over a list and you want to get the 358 00:15:55,310 --> 00:15:58,040 sum, well, you know, I just need to go through the list. 359 00:15:58,040 --> 00:16:01,020 Can I really parallelize that? 360 00:16:01,020 --> 00:16:04,720 There are, however, opportunities where you can 361 00:16:04,720 --> 00:16:07,100 reshape the computation in a way that exposes the 362 00:16:07,100 --> 00:16:08,130 concurrency. 363 00:16:08,130 --> 00:16:11,220 And often what this comes down to is you're going to do more 364 00:16:11,220 --> 00:16:16,000 work, but it's OK because you're going to finish faster. 365 00:16:16,000 --> 00:16:18,930 So this kind of work/concurrency tradeoff, I'm 366 00:16:18,930 --> 00:16:21,740 going to illustrate with an example. 367 00:16:21,740 --> 00:16:27,450 So in this application we have some graphs. 368 00:16:27,450 --> 00:16:29,650 And for each node in a graph, we want to 369 00:16:29,650 --> 00:16:31,170 know what is its root? 370 00:16:31,170 --> 00:16:34,620 So this works well when you have a forest where not all 371 00:16:34,620 --> 00:16:37,040 the graphs are connected and given a node you want to know 372 00:16:37,040 --> 00:16:39,750 who is the root of this graph. 373 00:16:39,750 --> 00:16:42,580 So what we can do is essentially have more 374 00:16:42,580 --> 00:16:45,700 concurrency by changing the way we actually think about 375 00:16:45,700 --> 00:16:46,630 the algorithm. 376 00:16:46,630 --> 00:16:49,670 So rather than starting with each node and then, in a 377 00:16:49,670 --> 00:16:53,000 directed graph, following its successor -- 378 00:16:53,000 --> 00:16:55,340 so this is essentially order n, because for each node we 379 00:16:55,340 --> 00:16:57,600 have to follow n links -- 380 00:16:57,600 --> 00:16:59,580 we can think about it slightly differently. 381 00:16:59,580 --> 00:17:02,510 So what if rather than finding the successor and then finding 382 00:17:02,510 --> 00:17:05,560 that successor's successor, at each computational step we 383 00:17:05,560 --> 00:17:07,040 start with a node and we say who is 384 00:17:07,040 --> 00:17:09,640 your successor's successor? 385 00:17:09,640 --> 00:17:14,000 So we can converge in this example in three steps. 386 00:17:14,000 --> 00:17:18,160 So from five to six we can say who is this successor? 387 00:17:18,160 --> 00:17:21,200 So who is the successor's successor of five? 388 00:17:21,200 --> 00:17:22,590 And that would be two. 389 00:17:22,590 --> 00:17:25,150 And similarly you can do that for seven and so on. 390 00:17:25,150 --> 00:17:27,520 And so you keep asking the question. 391 00:17:27,520 --> 00:17:29,540 So you can distribute all these data structures, 392 00:17:29,540 --> 00:17:31,970 repeatedly ask these questions out of all these end nodes, 393 00:17:31,970 --> 00:17:34,840 and it leads you to an order log n solution versus an order 394 00:17:34,840 --> 00:17:36,770 n solution. 395 00:17:36,770 --> 00:17:39,310 But what have I done in each step? 396 00:17:39,310 --> 00:17:43,850 Well, I've actually created myself and I've sort of 397 00:17:43,850 --> 00:17:48,220 increased the amount of work that I'm doing by order n. 398 00:17:48,220 --> 00:17:49,310 Right there. 399 00:17:49,310 --> 00:17:51,370 Right. 400 00:17:51,370 --> 00:17:52,260 Yes. 401 00:17:52,260 --> 00:17:56,680 Because I've essentially for each node doing n queries, you 402 00:17:56,680 --> 00:17:58,720 know, who's your successor's successor? 403 00:17:58,720 --> 00:18:01,460 Whereas in a sequential case, you know, I just need to do it 404 00:18:01,460 --> 00:18:02,980 once for each node. 405 00:18:02,980 --> 00:18:05,070 And that works really well. 406 00:18:05,070 --> 00:18:07,790 So most strategies based on this pattern of actually 407 00:18:07,790 --> 00:18:11,390 decomposing your computation according to recursive pattern 408 00:18:11,390 --> 00:18:16,860 lead you to doing much more work or some increase in the 409 00:18:16,860 --> 00:18:18,510 amount of work. 410 00:18:18,510 --> 00:18:20,080 But you get this back in because you can 411 00:18:20,080 --> 00:18:21,590 decrease your execution. 412 00:18:21,590 --> 00:18:23,260 And so this is a good tradeoff that you 413 00:18:23,260 --> 00:18:24,990 might want to consider. 414 00:18:24,990 --> 00:18:27,860 AUDIENCE: In the first one order n was sequential? 415 00:18:27,860 --> 00:18:28,520 PROFESSOR: Yeah, yeah. 416 00:18:28,520 --> 00:18:30,970 It's a typo. 417 00:18:30,970 --> 00:18:32,220 Yeah. 418 00:18:32,220 --> 00:18:34,630 419 00:18:34,630 --> 00:18:39,130 So organize by flow or organize by flow of data. 420 00:18:39,130 --> 00:18:41,560 And this is essentially the pipeline model. 421 00:18:41,560 --> 00:18:44,270 And we talked about this again in some of the recitations in 422 00:18:44,270 --> 00:18:47,960 terms of SPE to SPE communication. 423 00:18:47,960 --> 00:18:50,370 Or do you want to organize based on event-based 424 00:18:50,370 --> 00:18:52,440 mechanisms? 425 00:18:52,440 --> 00:18:56,050 So what these really come down to is, well, how regular is 426 00:18:56,050 --> 00:18:57,750 the flow of data in your application? 427 00:18:57,750 --> 00:19:01,445 If you have regular, let's say, one-way flow through a 428 00:19:01,445 --> 00:19:02,940 stable computation path -- 429 00:19:02,940 --> 00:19:05,690 so I've set up my sort of algorithm structure. 430 00:19:05,690 --> 00:19:08,190 Data is flowing through it at a regular rate. 431 00:19:08,190 --> 00:19:10,990 The computation graph isn't changing very much. 432 00:19:10,990 --> 00:19:13,490 Then I can essentially pipeline things really well. 433 00:19:13,490 --> 00:19:15,870 And this could be a linear chain of computation or it 434 00:19:15,870 --> 00:19:17,040 could be sort of nonlinear. 435 00:19:17,040 --> 00:19:20,780 There could be branches in the graph. 436 00:19:20,780 --> 00:19:25,740 And I can use that in a way to exploit pipeline parallelism. 437 00:19:25,740 --> 00:19:29,440 If I don't have sort of this nice, regular structure, it 438 00:19:29,440 --> 00:19:33,400 could be events that are created at run time. 439 00:19:33,400 --> 00:19:37,450 So, for example, you're a car wash attendant and a 440 00:19:37,450 --> 00:19:38,470 new car comes in. 441 00:19:38,470 --> 00:19:42,480 So you have to find a garage to assign to it and then turn 442 00:19:42,480 --> 00:19:45,180 on the car wash machine. 443 00:19:45,180 --> 00:19:48,540 So the dynamic threads are created based on sensory input 444 00:19:48,540 --> 00:19:51,100 that comes in, then you might want to use an events-based 445 00:19:51,100 --> 00:19:52,150 coordination. 446 00:19:52,150 --> 00:19:54,250 You have irregular computation. 447 00:19:54,250 --> 00:19:56,680 The computation might vary based on the data that comes 448 00:19:56,680 --> 00:19:58,420 into your system. 449 00:19:58,420 --> 00:20:03,460 And you might have unpredictable data flow. 450 00:20:03,460 --> 00:20:07,690 So in the pipeline model, the things to consider is the 451 00:20:07,690 --> 00:20:10,610 pipeline throughput versus the pipeline latency. 452 00:20:10,610 --> 00:20:13,720 So the amount of concurrency in a pipeline is really 453 00:20:13,720 --> 00:20:15,090 limited by the number of stages. 454 00:20:15,090 --> 00:20:16,450 This is nothing new. 455 00:20:16,450 --> 00:20:19,950 You've seen this, for example, in super scaled pipelines. 456 00:20:19,950 --> 00:20:23,460 And just as in this case, as in the case of an architecture 457 00:20:23,460 --> 00:20:25,870 pipeline, the amount of time it takes you to fill the 458 00:20:25,870 --> 00:20:28,160 pipeline and the amount of time it takes you to drain the 459 00:20:28,160 --> 00:20:29,940 pipeline can essentially limit your parallelism. 460 00:20:29,940 --> 00:20:32,550 So you want those to be really small compared to the actual 461 00:20:32,550 --> 00:20:35,930 computation that you spend in your pipeline. 462 00:20:35,930 --> 00:20:38,470 And the performance metric is usually the throughput. 463 00:20:38,470 --> 00:20:39,790 How much data can you pump through your 464 00:20:39,790 --> 00:20:43,630 pipeline per unit time? 465 00:20:43,630 --> 00:20:46,150 So in video encoding, you know, it's the frames per 466 00:20:46,150 --> 00:20:48,130 second that you can produce. 467 00:20:48,130 --> 00:20:50,150 And the pipeline latency, though, is important, 468 00:20:50,150 --> 00:20:52,760 especially in a real-time application where you need a 469 00:20:52,760 --> 00:20:54,470 result every 10 milliseconds. 470 00:20:54,470 --> 00:20:57,240 You know, your pacemaker for example has to produce a beep 471 00:20:57,240 --> 00:21:00,990 or a signal to your heart at specific rates. 472 00:21:00,990 --> 00:21:03,780 So you need to consider what is your pipeline throughput 473 00:21:03,780 --> 00:21:05,000 versus your pipeline latency? 474 00:21:05,000 --> 00:21:07,590 And that can actually determine how many stages you 475 00:21:07,590 --> 00:21:10,250 might want to actually decompose or organize your 476 00:21:10,250 --> 00:21:12,660 application in. 477 00:21:12,660 --> 00:21:15,790 And in the event-based coordination, these are 478 00:21:15,790 --> 00:21:19,120 interactions of tasks over unpredictable intervals. 479 00:21:19,120 --> 00:21:22,920 And you're more prone to sort of deadlocks in these 480 00:21:22,920 --> 00:21:23,600 applications. 481 00:21:23,600 --> 00:21:26,830 Because you might have cyclic dependencies where one event 482 00:21:26,830 --> 00:21:29,500 can't proceed until it gets data from another event. 483 00:21:29,500 --> 00:21:32,450 But it can't proceed until it gets data from another event. 484 00:21:32,450 --> 00:21:35,000 You can create sort of these complex interactions that 485 00:21:35,000 --> 00:21:36,780 often lead to deadlock. 486 00:21:36,780 --> 00:21:39,580 So you have to sort of be very careful in structuring things 487 00:21:39,580 --> 00:21:44,680 together so you don't end up with feedback loops that block 488 00:21:44,680 --> 00:21:47,130 computation progress. 489 00:21:47,130 --> 00:21:50,510 So given sort of these three organizational structures that 490 00:21:50,510 --> 00:21:53,880 say, you know, I can organize my computation by task or by 491 00:21:53,880 --> 00:21:57,120 the flow of data or by sort of the pipeline nature of the 492 00:21:57,120 --> 00:21:59,440 computation, what are the supporting structures? 493 00:21:59,440 --> 00:22:00,800 How do I actually implement these? 494 00:22:00,800 --> 00:22:03,510 And so there are many different supporting 495 00:22:03,510 --> 00:22:04,150 structures. 496 00:22:04,150 --> 00:22:08,880 I've identified sort of four that occur most often in 497 00:22:08,880 --> 00:22:11,190 literature and in books and common 498 00:22:11,190 --> 00:22:13,600 terminology that's used. 499 00:22:13,600 --> 00:22:18,090 And so those are SPMD, loop parallelism, the master/worker 500 00:22:18,090 --> 00:22:19,500 pattern, and the fork/join pattern. 501 00:22:19,500 --> 00:22:22,620 502 00:22:22,620 --> 00:22:26,070 In the SPMD pattern, you're talking about a single 503 00:22:26,070 --> 00:22:29,070 program, multiple data concept. 504 00:22:29,070 --> 00:22:31,000 So here you just have one program. 505 00:22:31,000 --> 00:22:33,680 You write it once and then you assign it to each of your 506 00:22:33,680 --> 00:22:34,900 processors to run. 507 00:22:34,900 --> 00:22:35,940 So it's the same program. 508 00:22:35,940 --> 00:22:38,490 It just runs on different machines. 509 00:22:38,490 --> 00:22:42,710 Now each program or each instance of the code can have 510 00:22:42,710 --> 00:22:44,490 different control flow that it takes. 511 00:22:44,490 --> 00:22:46,270 So just because they're running the same program 512 00:22:46,270 --> 00:22:48,830 doesn't mean the computation is happening in lock step. 513 00:22:48,830 --> 00:22:53,910 That would be a sort of a SIMD or vector-like computation. 514 00:22:53,910 --> 00:22:56,550 In this model you can actually take independent control flow. 515 00:22:56,550 --> 00:22:59,810 It could be different behavior in each instance of the code. 516 00:22:59,810 --> 00:23:02,140 But you're running the same code everywhere. 517 00:23:02,140 --> 00:23:04,550 So this is slightly different, for example, from what you've 518 00:23:04,550 --> 00:23:09,480 seen on Cell, where you have the PPE thread that creates 519 00:23:09,480 --> 00:23:10,540 SPE threads. 520 00:23:10,540 --> 00:23:13,720 Sometimes the SPE threads are the same, but it's not always 521 00:23:13,720 --> 00:23:15,560 the case that the PPE threads and the SPE 522 00:23:15,560 --> 00:23:17,610 threads are the same. 523 00:23:17,610 --> 00:23:21,110 So in the SPMD model there are really five steps that you do. 524 00:23:21,110 --> 00:23:25,020 You initialize sort of your computation in the world of 525 00:23:25,020 --> 00:23:27,250 sort of code instances that you're going to run. 526 00:23:27,250 --> 00:23:29,730 And for each one you obtain a unique identifier. 527 00:23:29,730 --> 00:23:32,250 And this usually helps them being able to determine who 528 00:23:32,250 --> 00:23:37,250 needs to communicate with who or ordering dependencies. 529 00:23:37,250 --> 00:23:41,560 And you run the same program on each processor. 530 00:23:41,560 --> 00:23:44,030 And what you need to do in this case is also distribute 531 00:23:44,030 --> 00:23:45,860 your data between each of the different 532 00:23:45,860 --> 00:23:48,030 instances of your code. 533 00:23:48,030 --> 00:23:50,870 And once, you know, each program is running, it's 534 00:23:50,870 --> 00:23:53,300 computing on its data, eventually you need to 535 00:23:53,300 --> 00:23:54,570 finalize in some way. 536 00:23:54,570 --> 00:23:57,790 And so that might mean doing a reduction to communicate all 537 00:23:57,790 --> 00:24:03,160 the data to one processor to actually output the value. 538 00:24:03,160 --> 00:24:06,620 And so we saw in SPMD an example for the numerical 539 00:24:06,620 --> 00:24:09,120 integration for calculating pi. 540 00:24:09,120 --> 00:24:14,000 And if you remember, so we had this very simple c loop. 541 00:24:14,000 --> 00:24:19,780 And we showed the MPI implementation of the c loop. 542 00:24:19,780 --> 00:24:23,070 And so in this code, what we're doing is we're trying to 543 00:24:23,070 --> 00:24:24,760 determine different intervals. 544 00:24:24,760 --> 00:24:27,980 And for each interval we're going to calculate a value and 545 00:24:27,980 --> 00:24:33,420 then in the MPI program we're essentially deciding how big 546 00:24:33,420 --> 00:24:36,640 an interval each process should run. 547 00:24:36,640 --> 00:24:37,770 So it's the same program. 548 00:24:37,770 --> 00:24:39,640 It runs on every single machine or 549 00:24:39,640 --> 00:24:41,150 every single processor. 550 00:24:41,150 --> 00:24:46,290 And each processor determines based on its ID which interval 551 00:24:46,290 --> 00:24:49,340 of the actual integration to do. 552 00:24:49,340 --> 00:24:51,320 And so in this model we're distributing 553 00:24:51,320 --> 00:24:53,000 work relatively evenly. 554 00:24:53,000 --> 00:24:56,400 Each processor is doing a specific chunk that starts at 555 00:24:56,400 --> 00:24:57,930 say some index i. 556 00:24:57,930 --> 00:25:01,810 And if I have 10 processors, I'm doing 100 steps. 557 00:25:01,810 --> 00:25:06,260 Then you're doing i, i plus 10, i plus 20 and so on. 558 00:25:06,260 --> 00:25:08,320 But I can do a different distribution. 559 00:25:08,320 --> 00:25:10,050 So the first is a block distribution. 560 00:25:10,050 --> 00:25:12,490 I can do something called a cyclic distribution. 561 00:25:12,490 --> 00:25:15,420 So in a cyclic distribution, I distribute work sort of in a 562 00:25:15,420 --> 00:25:18,080 round robin fashion or some other mechanism. 563 00:25:18,080 --> 00:25:21,940 So here, you know, each processor -- 564 00:25:21,940 --> 00:25:24,780 sorry. 565 00:25:24,780 --> 00:25:28,480 In the block distribution I sort of start at interval i 566 00:25:28,480 --> 00:25:31,380 and I go -- 567 00:25:31,380 --> 00:25:32,660 sorry. 568 00:25:32,660 --> 00:25:36,180 So each processor gets one entire slice here. 569 00:25:36,180 --> 00:25:38,260 So I start here and I go through to completion. 570 00:25:38,260 --> 00:25:41,760 I start here and go through to completion. 571 00:25:41,760 --> 00:25:45,100 In a cyclic distribution I might do smaller slices of 572 00:25:45,100 --> 00:25:47,670 each one of those intervals. 573 00:25:47,670 --> 00:25:52,320 And so I greyed out the components for the block 574 00:25:52,320 --> 00:25:55,040 distribution to show you that for a contrast here. 575 00:25:55,040 --> 00:25:58,070 576 00:25:58,070 --> 00:26:01,610 There are some challenges in the SPMD model. 577 00:26:01,610 --> 00:26:02,980 And that is how do you actually 578 00:26:02,980 --> 00:26:05,030 split your data correctly? 579 00:26:05,030 --> 00:26:08,410 You have to distribute your data in a way that, you know, 580 00:26:08,410 --> 00:26:11,690 doesn't increase contention on your memory system, where each 581 00:26:11,690 --> 00:26:14,900 actual processor that's assigned the computation has 582 00:26:14,900 --> 00:26:18,630 data locally to actually operate on. 583 00:26:18,630 --> 00:26:21,480 And you want to achieve an even work distribution. 584 00:26:21,480 --> 00:26:23,800 You know, do you need a dynamic load balancing scheme 585 00:26:23,800 --> 00:26:26,740 or can you use an alternative pattern 586 00:26:26,740 --> 00:26:28,020 if that's not suitable? 587 00:26:28,020 --> 00:26:30,740 588 00:26:30,740 --> 00:26:34,750 So the second pattern, as opposed to the SPMD pattern is 589 00:26:34,750 --> 00:26:36,100 loop parallelism pattern. 590 00:26:36,100 --> 00:26:39,220 In this case, this is the best suited when you actually have 591 00:26:39,220 --> 00:26:42,870 a programming model or a program that you can't really 592 00:26:42,870 --> 00:26:44,880 change a whole lot or that you don't really want to 593 00:26:44,880 --> 00:26:46,590 change a whole lot. 594 00:26:46,590 --> 00:26:48,680 Or you have a programming model that allows you to sort 595 00:26:48,680 --> 00:26:52,790 of identify loops that take up most of the computation and 596 00:26:52,790 --> 00:26:55,160 then insert annotations or some ways to automatically 597 00:26:55,160 --> 00:26:57,350 parallelize those loops. 598 00:26:57,350 --> 00:27:01,660 So we saw in the OpenMP example, you have some loops 599 00:27:01,660 --> 00:27:03,580 you can insert these pragmas that say, 600 00:27:03,580 --> 00:27:05,000 this loop is parallel. 601 00:27:05,000 --> 00:27:08,390 And the compiler in the run-time time system can 602 00:27:08,390 --> 00:27:11,270 automatically partition this loop into smaller chunks. 603 00:27:11,270 --> 00:27:15,880 And then each chunk can compute in parallel. 604 00:27:15,880 --> 00:27:19,200 And you might apply this scheme in different ways 605 00:27:19,200 --> 00:27:21,870 depending on how well you understand your code. 606 00:27:21,870 --> 00:27:25,390 Are you running on a shared memory machine? 607 00:27:25,390 --> 00:27:27,490 You can't afford to do a whole lot of restructuring. 608 00:27:27,490 --> 00:27:29,440 Communication costs might be really expensive. 609 00:27:29,440 --> 00:27:33,480 610 00:27:33,480 --> 00:27:36,910 In the master/worker pattern, this is really starting to get 611 00:27:36,910 --> 00:27:41,040 closer to what we've done with the Cell 612 00:27:41,040 --> 00:27:43,000 recitations in the Cell labs. 613 00:27:43,000 --> 00:27:46,850 You have some world of independent tasks and the 614 00:27:46,850 --> 00:27:50,460 master essentially running and distributing each of these 615 00:27:50,460 --> 00:27:53,490 tasks to different processors. 616 00:27:53,490 --> 00:27:57,240 So in this case you'd get several advantages that you 617 00:27:57,240 --> 00:27:58,150 can leverage. 618 00:27:58,150 --> 00:28:00,830 If each of your tasks are varied in nature -- and they 619 00:28:00,830 --> 00:28:03,390 might finish at different times or they require 620 00:28:03,390 --> 00:28:06,140 different kinds of resources, you can use this model to sort 621 00:28:06,140 --> 00:28:10,640 of view your machine as sort of a non-symmetric processor. 622 00:28:10,640 --> 00:28:12,090 Not everybody is the same. 623 00:28:12,090 --> 00:28:15,040 And you can use this model really well for that. 624 00:28:15,040 --> 00:28:18,790 So you can distribute these and then you can do dynamic 625 00:28:18,790 --> 00:28:19,390 load balancing. 626 00:28:19,390 --> 00:28:21,670 Because as processors -- 627 00:28:21,670 --> 00:28:25,120 as workers finish you can ship them more and more data. 628 00:28:25,120 --> 00:28:35,580 So it has some particularly relevant properties for 629 00:28:35,580 --> 00:28:38,200 heterogeneous computations, but it's also really good for 630 00:28:38,200 --> 00:28:40,140 when you have a whole lot of parallelism in your 631 00:28:40,140 --> 00:28:40,790 application. 632 00:28:40,790 --> 00:28:42,400 So something called embarrassingly parallel 633 00:28:42,400 --> 00:28:46,200 problems. So ray tracing, molecular dynamics, a lot of 634 00:28:46,200 --> 00:28:48,780 scientific applications have these massive levels of 635 00:28:48,780 --> 00:28:49,580 parallelism. 636 00:28:49,580 --> 00:28:52,060 And you can use this essentially work-queue based 637 00:28:52,060 --> 00:28:54,690 mechanism that says I have all these tasks and I'll just 638 00:28:54,690 --> 00:28:58,700 dispatch them to workers and compute. 639 00:28:58,700 --> 00:29:00,970 And as I pointed out earlier, you know, when do you define 640 00:29:00,970 --> 00:29:03,040 your entire computation to have completed? 641 00:29:03,040 --> 00:29:05,160 You know, sometimes you're computing a result until 642 00:29:05,160 --> 00:29:09,210 you've reached some result. 643 00:29:09,210 --> 00:29:13,080 And often you're willing to accept a result within some 644 00:29:13,080 --> 00:29:14,720 range of error. 645 00:29:14,720 --> 00:29:16,155 And you might have some more threads that 646 00:29:16,155 --> 00:29:17,290 are still in flight. 647 00:29:17,290 --> 00:29:19,860 Do you terminate your computation then or not? 648 00:29:19,860 --> 00:29:21,580 What are some issues with synchronization? 649 00:29:21,580 --> 00:29:24,020 If you have so many threads that are running together, you 650 00:29:24,020 --> 00:29:26,900 know, does the communication between them to send out these 651 00:29:26,900 --> 00:29:29,970 control messages say, I'm done, start to overwhelm you? 652 00:29:29,970 --> 00:29:33,660 653 00:29:33,660 --> 00:29:36,240 In the fork/join pattern -- 654 00:29:36,240 --> 00:29:40,610 this is really not conceptually too different in 655 00:29:40,610 --> 00:29:46,570 my mind from the master/worker model, and also very relevant 656 00:29:46,570 --> 00:29:49,840 to what we've done with Cell. 657 00:29:49,840 --> 00:29:52,300 The main difference might be that you have tasks that are 658 00:29:52,300 --> 00:29:54,610 dynamically created. 659 00:29:54,610 --> 00:29:56,960 So in the embarrassingly parallel case, you actually 660 00:29:56,960 --> 00:29:59,170 know the world of all your potential task that you're 661 00:29:59,170 --> 00:30:00,280 going to run in parallel. 662 00:30:00,280 --> 00:30:02,920 In the fork/join join model some new computation might 663 00:30:02,920 --> 00:30:06,720 come up as a result of, say, an event-based mechanism. 664 00:30:06,720 --> 00:30:09,870 So a task might be created dynamically and then later 665 00:30:09,870 --> 00:30:11,760 terminated or they might complete. 666 00:30:11,760 --> 00:30:13,900 And so new ones come up as a result. 667 00:30:13,900 --> 00:30:21,660 AUDIENCE: It almost seems like you are forking the task in 668 00:30:21,660 --> 00:30:23,910 the forking model. 669 00:30:23,910 --> 00:30:25,758 And then keep assigning tasks to that. 670 00:30:25,758 --> 00:30:28,203 The fork/join model you just keep forking at 671 00:30:28,203 --> 00:30:29,670 first virtual box. 672 00:30:29,670 --> 00:30:31,501 Might not be completely matched to a number of 673 00:30:31,501 --> 00:30:32,751 processor available. 674 00:30:32,751 --> 00:30:43,430 675 00:30:43,430 --> 00:30:44,680 fork them out. 676 00:30:44,680 --> 00:30:46,910 677 00:30:46,910 --> 00:30:52,420 PROFESSOR: So the process that's equating all these 678 00:30:52,420 --> 00:30:56,250 threads or that's doing all the forking is often known as 679 00:30:56,250 --> 00:30:58,600 the parent and the tasks that are 680 00:30:58,600 --> 00:31:00,240 generated are the children. 681 00:31:00,240 --> 00:31:04,580 And eventually essentially the parent can't continue or can't 682 00:31:04,580 --> 00:31:07,760 resume until its children have sort of completed or have 683 00:31:07,760 --> 00:31:10,130 reached the join point. 684 00:31:10,130 --> 00:31:15,730 And so those are really some of the models that we've seen 685 00:31:15,730 --> 00:31:18,870 already, in a lot of cases in the recitations and labs for 686 00:31:18,870 --> 00:31:20,720 how you run your computations. 687 00:31:20,720 --> 00:31:23,100 And some of you have already discovered these and actually 688 00:31:23,100 --> 00:31:25,230 are thinking about how your projects should be sort of 689 00:31:25,230 --> 00:31:30,160 parallelized for your actual Cell demos. 690 00:31:30,160 --> 00:31:32,610 Some of the other things that I'm just going to talk about 691 00:31:32,610 --> 00:31:34,020 are communication patterns. 692 00:31:34,020 --> 00:31:37,300 So two lectures ago you saw, for example, that you have 693 00:31:37,300 --> 00:31:40,430 point to point communication or you have broadcast 694 00:31:40,430 --> 00:31:41,400 communication. 695 00:31:41,400 --> 00:31:43,430 So in point to point communication, you have two 696 00:31:43,430 --> 00:31:45,010 tasks that need to communicate. 697 00:31:45,010 --> 00:31:47,720 And they can send explicit messages to each other. 698 00:31:47,720 --> 00:31:49,762 These could be control messages that say I'm done or 699 00:31:49,762 --> 00:31:50,940 I'm waiting for data. 700 00:31:50,940 --> 00:31:53,560 Or they could be data messages that actually ships you a 701 00:31:53,560 --> 00:31:55,640 particular data element that you might need. 702 00:31:55,640 --> 00:31:58,250 And again we've seen this with Cell. 703 00:31:58,250 --> 00:32:01,320 Broadcast says, you know, I have some result that 704 00:32:01,320 --> 00:32:02,050 everybody needs. 705 00:32:02,050 --> 00:32:05,570 And so I send that out to everybody by some mechanism. 706 00:32:05,570 --> 00:32:09,530 There is no real broadcast mechanism on Cell. 707 00:32:09,530 --> 00:32:12,000 The concept I'm going to talk about though is the reduction 708 00:32:12,000 --> 00:32:15,900 mechanism, which really is the inverse of the broadcast. So 709 00:32:15,900 --> 00:32:18,070 in the broadcast I have a data element I need to send to 710 00:32:18,070 --> 00:32:19,060 everybody else. 711 00:32:19,060 --> 00:32:23,090 In the reduction, all of you have data that I need or all 712 00:32:23,090 --> 00:32:25,370 of us have data that each somebody else needs. 713 00:32:25,370 --> 00:32:27,990 So what we need to do is collectively bring that data 714 00:32:27,990 --> 00:32:34,730 together or group it together and generate an end result. 715 00:32:34,730 --> 00:32:40,160 So a simple example of a reduction, you have some array 716 00:32:40,160 --> 00:32:42,530 of elements that you want to add together. 717 00:32:42,530 --> 00:32:45,240 And sort of the result of the collective 718 00:32:45,240 --> 00:32:47,800 operation is the end sum. 719 00:32:47,800 --> 00:32:52,790 So you have an array of four elements, A0, A1, A2, and A3. 720 00:32:52,790 --> 00:32:54,080 And you can do a serial reduction. 721 00:32:54,080 --> 00:32:56,810 I can take A0 and add it to A1. 722 00:32:56,810 --> 00:32:59,360 And that gives me a result. 723 00:32:59,360 --> 00:33:02,030 And I can take A2 and add that to it. 724 00:33:02,030 --> 00:33:03,900 And I can take A3 and add that to it. 725 00:33:03,900 --> 00:33:06,530 And so at the end I'll have sort of calculated the sum 726 00:33:06,530 --> 00:33:09,030 from A0 to A3. 727 00:33:09,030 --> 00:33:12,880 So this is essentially -- the serial reduction applies when 728 00:33:12,880 --> 00:33:15,080 your operation is an associative. 729 00:33:15,080 --> 00:33:17,450 So the addition is associative. 730 00:33:17,450 --> 00:33:22,100 So in this case I can actually do something more intelligent. 731 00:33:22,100 --> 00:33:24,280 And I think we talked about that last time. 732 00:33:24,280 --> 00:33:26,470 I'm going to show you some more examples. 733 00:33:26,470 --> 00:33:29,980 And often sort of the end result follows a broadcast. It 734 00:33:29,980 --> 00:33:31,130 says, here is the end result. 735 00:33:31,130 --> 00:33:32,530 Who are all the people that need it? 736 00:33:32,530 --> 00:33:35,030 I'll sort of broadcast that out so that 737 00:33:35,030 --> 00:33:36,820 everybody has the result. 738 00:33:36,820 --> 00:33:39,320 If your operation isn't associative, then you're 739 00:33:39,320 --> 00:33:41,790 essentially limited to a serial process. 740 00:33:41,790 --> 00:33:45,130 And so that's not very good from a performance standpoint. 741 00:33:45,130 --> 00:33:48,860 742 00:33:48,860 --> 00:33:50,890 Some of the tricks you can apply for actually getting 743 00:33:50,890 --> 00:33:54,030 performance out of your reduction is to go to a 744 00:33:54,030 --> 00:33:55,840 tree-based reduction model. 745 00:33:55,840 --> 00:33:57,200 So this might be very obvious. 746 00:33:57,200 --> 00:34:01,010 Rather than doing A0 and A1 and then adding A2 to that 747 00:34:01,010 --> 00:34:03,650 result, I can do A0 and A1 together. 748 00:34:03,650 --> 00:34:05,860 In parallel I can do A2 and A3. 749 00:34:05,860 --> 00:34:07,830 And then I can get those results and add them together. 750 00:34:07,830 --> 00:34:12,150 So rather than doing n steps I can do log n steps. 751 00:34:12,150 --> 00:34:15,240 So this is particularly attractive when only one task 752 00:34:15,240 --> 00:34:16,100 needs the result. 753 00:34:16,100 --> 00:34:19,280 So in the MPI program when we're doing the integration to 754 00:34:19,280 --> 00:34:21,920 calculate pi, you know, one processor needs to print out 755 00:34:21,920 --> 00:34:23,170 that value of pi. 756 00:34:23,170 --> 00:34:25,440 757 00:34:25,440 --> 00:34:29,020 But if you have a computation where more than one process 758 00:34:29,020 --> 00:34:31,170 actually needs the result of the reduction, there's 759 00:34:31,170 --> 00:34:35,170 actually a better mechanism you can use that's sort of a 760 00:34:35,170 --> 00:34:38,060 better alternative to the tree-based reduction followed 761 00:34:38,060 --> 00:34:39,950 by a broadcast. So you can do a 762 00:34:39,950 --> 00:34:44,010 recursive doubling reduction. 763 00:34:44,010 --> 00:34:47,110 So at the end here, every process will have the result 764 00:34:47,110 --> 00:34:50,550 of the reduction without having done the broadcast. So 765 00:34:50,550 --> 00:34:54,950 we can start off as with the tree-based and add up A0 and 766 00:34:54,950 --> 00:34:56,380 A1 together. 767 00:34:56,380 --> 00:35:00,110 But what we do is for each process that has a value, we 768 00:35:00,110 --> 00:35:01,960 sort of do a local exchange. 769 00:35:01,960 --> 00:35:05,100 So from here we communicate the value to here. 770 00:35:05,100 --> 00:35:06,900 And from here we communicate the value to here. 771 00:35:06,900 --> 00:35:09,900 And so now these two processors that had the value 772 00:35:09,900 --> 00:35:13,700 independently now both have a local sum, A0 to A1. 773 00:35:13,700 --> 00:35:17,140 And similarly we can sort of make the similar symmetric 774 00:35:17,140 --> 00:35:19,740 computation on the other side. 775 00:35:19,740 --> 00:35:23,340 And now we can communicate data from these two processors 776 00:35:23,340 --> 00:35:25,610 here to come up with the end -- 777 00:35:25,610 --> 00:35:30,150 778 00:35:30,150 --> 00:35:31,630 PROFESSOR: It was there. 779 00:35:31,630 --> 00:35:32,720 All right. 780 00:35:32,720 --> 00:35:33,420 Must have been lost in the animation. 781 00:35:33,420 --> 00:35:37,000 So you actually do that the other way as well so that you 782 00:35:37,000 --> 00:35:42,050 have the sum A0 to A3 on all the different processors. 783 00:35:42,050 --> 00:35:45,350 Sorry about the lost animation. 784 00:35:45,350 --> 00:35:45,620 OK. 785 00:35:45,620 --> 00:35:47,870 So this is better than the tree-based approach with a 786 00:35:47,870 --> 00:35:51,070 broadcast because you end up with local 787 00:35:51,070 --> 00:35:53,410 results of your reduction. 788 00:35:53,410 --> 00:35:58,780 And rather than doing the broadcast following the 789 00:35:58,780 --> 00:36:01,900 tree-based reduction which takes n steps, you end up with 790 00:36:01,900 --> 00:36:03,000 an order n. 791 00:36:03,000 --> 00:36:06,450 Everybody has a result in order n versus an order 2n 792 00:36:06,450 --> 00:36:10,590 process for the tree-based plus broadcast. 793 00:36:10,590 --> 00:36:13,200 AUDIENCE: On the Cell processor but not in general. 794 00:36:13,200 --> 00:36:15,250 PROFESSOR: Not in general. 795 00:36:15,250 --> 00:36:17,900 It depends on sort of the architectural mechanism that 796 00:36:17,900 --> 00:36:20,860 you have for your network. 797 00:36:20,860 --> 00:36:23,220 If you actually do need to sort of, you know, if you have 798 00:36:23,220 --> 00:36:26,205 a broadcast mechanism that has bus-based architecture where 799 00:36:26,205 --> 00:36:28,970 you can deposit a local value, everybody can pull that value, 800 00:36:28,970 --> 00:36:31,410 then, yeah, it can be more efficient. 801 00:36:31,410 --> 00:36:33,960 Or on optical networks, you can broadcast the data and 802 00:36:33,960 --> 00:36:35,480 everybody can just fuse it out. 803 00:36:35,480 --> 00:36:38,910 804 00:36:38,910 --> 00:36:40,340 OK. 805 00:36:40,340 --> 00:36:44,860 So summarizing all the different patterns, so here 806 00:36:44,860 --> 00:36:47,490 these are the actual mechanisms that you would use 807 00:36:47,490 --> 00:36:50,800 for how you would implement the different patterns. 808 00:36:50,800 --> 00:36:53,130 So in the SPMD you would write the same program. 809 00:36:53,130 --> 00:36:56,700 In loop parallelism you have your program and you might 810 00:36:56,700 --> 00:36:59,380 annotate sort of some pragmas that tell you how to 811 00:36:59,380 --> 00:37:01,200 parallelize your computation. 812 00:37:01,200 --> 00:37:04,740 In the master/worker model you might have sort of a master 813 00:37:04,740 --> 00:37:07,690 that's going to create threads and you actually know -- 814 00:37:07,690 --> 00:37:10,800 you might sort of have a very good idea of what is the kind 815 00:37:10,800 --> 00:37:12,780 of work you're going to have to do in each thread. 816 00:37:12,780 --> 00:37:16,340 In the fork/join model you have more dynamism. 817 00:37:16,340 --> 00:37:19,280 So you might create threads on the fly. 818 00:37:19,280 --> 00:37:25,890 And you apply these sort of based on appeal or what is 819 00:37:25,890 --> 00:37:28,490 more suited in terms of implementation to each of the 820 00:37:28,490 --> 00:37:31,190 different patterns for how you actually organize your data. 821 00:37:31,190 --> 00:37:34,020 So in the task parallelism model, this is where you have 822 00:37:34,020 --> 00:37:37,410 a world of threads that you know you're going to calculate 823 00:37:37,410 --> 00:37:39,290 or that you're going to use for your computation. 824 00:37:39,290 --> 00:37:43,610 And really you can use largely any one of these models. 825 00:37:43,610 --> 00:37:45,340 So I used a ranking system where four 826 00:37:45,340 --> 00:37:46,340 stars is really good. 827 00:37:46,340 --> 00:37:51,192 One star is sort of bad or no star means not well suited. 828 00:37:51,192 --> 00:37:53,320 AUDIENCE: Sort of in Cell because the 829 00:37:53,320 --> 00:37:55,220 inherit master there. 830 00:37:55,220 --> 00:37:58,110 Sometimes master/worker might get a little bit of a biasing 831 00:37:58,110 --> 00:37:59,910 than this one. 832 00:37:59,910 --> 00:38:01,210 PROFESSOR: Right, so -- 833 00:38:01,210 --> 00:38:04,600 AUDIENCE: You don't have to pay a cost of having master 834 00:38:04,600 --> 00:38:04,950 PROFESSOR: Right. 835 00:38:04,950 --> 00:38:06,050 Right. 836 00:38:06,050 --> 00:38:10,230 Although you could use the Cell master to do regular 837 00:38:10,230 --> 00:38:11,660 computations as well. 838 00:38:11,660 --> 00:38:14,390 But, yes. 839 00:38:14,390 --> 00:38:18,700 So and the divide and conquer model, you know, might be 840 00:38:18,700 --> 00:38:20,930 especially well suited for a fork and join because you're 841 00:38:20,930 --> 00:38:23,580 creating all these recursive subproblems They might be 842 00:38:23,580 --> 00:38:24,080 heterogeneous. 843 00:38:24,080 --> 00:38:26,160 In the nature of the computation that you do, you 844 00:38:26,160 --> 00:38:28,850 might have more problems created dynamically. 845 00:38:28,850 --> 00:38:30,300 Fork/join really works well for that. 846 00:38:30,300 --> 00:38:33,220 And the fact, you know, the subproblem structure that I 847 00:38:33,220 --> 00:38:35,400 showed, the graph of sort of division. 848 00:38:35,400 --> 00:38:40,260 And then merging works really well with the fork/join model. 849 00:38:40,260 --> 00:38:45,260 In the recursive, in the geometric decomposition -- 850 00:38:45,260 --> 00:38:48,420 this is essentially your lab one exercise and the things we 851 00:38:48,420 --> 00:38:50,330 went over yesterday in the recitation. 852 00:38:50,330 --> 00:38:54,860 You're taking data and you're partitioning over multiple 853 00:38:54,860 --> 00:38:56,960 processors to actually compute in parallel. 854 00:38:56,960 --> 00:39:00,510 So this could be SPMD implementation or it could be 855 00:39:00,510 --> 00:39:02,570 a loop parallelism implementation, 856 00:39:02,570 --> 00:39:04,880 which we didn't do. 857 00:39:04,880 --> 00:39:07,280 Less suitable, the master/worker and fork/join, 858 00:39:07,280 --> 00:39:10,960 often because the geometric decomposition applied some 859 00:39:10,960 --> 00:39:13,510 distribution to the data which has static properties that you 860 00:39:13,510 --> 00:39:15,640 can exploit in various ways. 861 00:39:15,640 --> 00:39:17,420 So you don't need to pay the overhead of 862 00:39:17,420 --> 00:39:21,320 master/worker or fork/join. 863 00:39:21,320 --> 00:39:25,670 Recursive data structures sort of have very specific models 864 00:39:25,670 --> 00:39:27,320 that you can run with. 865 00:39:27,320 --> 00:39:32,950 Largely master/worker is a decent implementation choice. 866 00:39:32,950 --> 00:39:36,180 SPMD is another. 867 00:39:36,180 --> 00:39:38,220 And you're going to hear more about sort of the pipeline 868 00:39:38,220 --> 00:39:40,620 mechanism in the next talk so I'm not going to talk about 869 00:39:40,620 --> 00:39:42,120 that very much. 870 00:39:42,120 --> 00:39:44,710 Event-based coordination, largely dynamic. 871 00:39:44,710 --> 00:39:47,090 So fork/join works really well. 872 00:39:47,090 --> 00:39:48,780 So one -- 873 00:39:48,780 --> 00:39:50,870 AUDIENCE: When you're buffering them you could do 874 00:39:50,870 --> 00:39:52,570 master/worker with pipelining? 875 00:39:52,570 --> 00:39:54,960 PROFESSOR: Yes, so next slide. 876 00:39:54,960 --> 00:39:58,860 So sort of these choices or these tradeoffs aren't really 877 00:39:58,860 --> 00:39:59,570 orthogonal. 878 00:39:59,570 --> 00:40:01,670 You can actually combine them in different ways. 879 00:40:01,670 --> 00:40:05,430 And in a lot of applications what you might find is that 880 00:40:05,430 --> 00:40:08,710 the different patterns compose hierarchically. 881 00:40:08,710 --> 00:40:12,540 And you actually want that in various ways -- 882 00:40:12,540 --> 00:40:13,610 for various reasons. 883 00:40:13,610 --> 00:40:17,530 So in the MPEG example, you know, we had tasks here within 884 00:40:17,530 --> 00:40:21,720 each task and identified some pipeline stages. 885 00:40:21,720 --> 00:40:23,680 You know, here I have some data parallelism so I can 886 00:40:23,680 --> 00:40:27,590 apply the loop pattern here. 887 00:40:27,590 --> 00:40:30,400 And what I want to do is actually in my computation 888 00:40:30,400 --> 00:40:32,870 sort of express these different mechanisms so I can 889 00:40:32,870 --> 00:40:34,910 understand sort of different tradeoffs. 890 00:40:34,910 --> 00:40:37,670 And for really large applications, there might be 891 00:40:37,670 --> 00:40:40,630 different patterns that are well suited for the actual 892 00:40:40,630 --> 00:40:41,910 computation that I'm doing. 893 00:40:41,910 --> 00:40:45,860 So I can combine things like pipelining with a task-based 894 00:40:45,860 --> 00:40:49,270 mechanism or data parallelism to actually get really good 895 00:40:49,270 --> 00:40:50,660 performance speedups. 896 00:40:50,660 --> 00:40:53,850 And one of the things that might strike you as well, 897 00:40:53,850 --> 00:40:55,740 heck, this is a whole lot of work that I have to do to 898 00:40:55,740 --> 00:40:58,810 actually get my code in the right way so that I can 899 00:40:58,810 --> 00:41:01,050 actually take advantage of my parallel architecture. 900 00:41:01,050 --> 00:41:03,020 You know, I have to conceptually think about the 901 00:41:03,020 --> 00:41:04,250 question the right way. 902 00:41:04,250 --> 00:41:07,270 I have to maybe restructure my computation in different ways 903 00:41:07,270 --> 00:41:09,790 to actually exploit parallelism. 904 00:41:09,790 --> 00:41:11,260 Data distribution is really hard. 905 00:41:11,260 --> 00:41:12,810 I have to get that right. 906 00:41:12,810 --> 00:41:14,970 Synchronization issues might be a problem. 907 00:41:14,970 --> 00:41:16,330 And how much buffering do I need to do 908 00:41:16,330 --> 00:41:18,020 between different tasks? 909 00:41:18,020 --> 00:41:20,210 So the thing you're going to hear about in the next talk 910 00:41:20,210 --> 00:41:22,820 is, well, what if these things really fall out naturally from 911 00:41:22,820 --> 00:41:27,800 the way you actually write the program, and if the way you 912 00:41:27,800 --> 00:41:30,360 actually write your program matches really well with the 913 00:41:30,360 --> 00:41:32,530 intuitive, sort of natural 914 00:41:32,530 --> 00:41:35,000 conceptualization of the problem. 915 00:41:35,000 --> 00:41:37,410 And so I'll leave Bill to talk about that. 916 00:41:37,410 --> 00:41:40,190 And I'm going to stop here. 917 00:41:40,190 --> 00:41:41,883 Any questions? 918 00:41:41,883 --> 00:41:43,099 AUDIENCE: We can take in some questions and 919 00:41:43,099 --> 00:41:44,349 then everybody -- 920 00:41:44,349 --> 00:41:50,650 921 00:41:50,650 --> 00:41:52,460 AUDIENCE: You talked about fork and join. 922 00:41:52,460 --> 00:41:57,735 When you have a parent thread that spawns off to a child 923 00:41:57,735 --> 00:42:01,403 thread, how do you keep your parent thread from 924 00:42:01,403 --> 00:42:03,970 using up the SPE? 925 00:42:03,970 --> 00:42:07,290 PROFESSOR: So you have a fork/join where you have -- 926 00:42:07,290 --> 00:42:11,780 AUDIENCE: Most of the parents it might be the PPE. 927 00:42:11,780 --> 00:42:17,430 And so if you just do fork/join, might not really 928 00:42:17,430 --> 00:42:20,220 use PPE unless you can, you know, you have some time and 929 00:42:20,220 --> 00:42:22,670 you let it do some of the task and come back. 930 00:42:22,670 --> 00:42:26,859 AUDIENCE: So for our purposes we shouldn't spawn off new 931 00:42:26,859 --> 00:42:29,110 threads by the SPEs? 932 00:42:29,110 --> 00:42:29,810 PROFESSOR: So, yeah. 933 00:42:29,810 --> 00:42:32,390 So most of the threads that are spawned off 934 00:42:32,390 --> 00:42:34,160 are done by the PPE. 935 00:42:34,160 --> 00:42:35,560 So you have these -- 936 00:42:35,560 --> 00:42:39,200 in fact a good walk through in recitation yesterday. 937 00:42:39,200 --> 00:42:39,950 You have the PPE. 938 00:42:39,950 --> 00:42:43,065 Essentially it sends messages to the SPEs that says, create 939 00:42:43,065 --> 00:42:45,190 these threads and start running them. 940 00:42:45,190 --> 00:42:46,400 Here's the data for them. 941 00:42:46,400 --> 00:42:48,950 And then these threads run on the SPEs. 942 00:42:48,950 --> 00:42:50,680 And they just do local computation. 943 00:42:50,680 --> 00:42:53,570 And then they send messages back to the PPE that says, 944 00:42:53,570 --> 00:42:54,050 we're done. 945 00:42:54,050 --> 00:42:56,422 So that essentially implements the join mechanism. 946 00:42:56,422 --> 00:42:58,390 AUDIENCE: On the other hand, if you are doing something 947 00:42:58,390 --> 00:43:05,430 like master slave way, and then the SPE can send a 948 00:43:05,430 --> 00:43:09,690 message and deliver another job into the PPE 949 00:43:09,690 --> 00:43:11,830 who feeds the master. 950 00:43:11,830 --> 00:43:13,930 If SPE see there's some more computers, you can say, OK, 951 00:43:13,930 --> 00:43:16,360 look, put this into your keybord and keep sending 952 00:43:16,360 --> 00:43:19,740 messages and so the master can look at that and update it. 953 00:43:19,740 --> 00:43:22,935 So, you know, it's not only master who has to fork off but 954 00:43:22,935 --> 00:43:24,670 the slaves also. 955 00:43:24,670 --> 00:43:28,470 They still can send information back. 956 00:43:28,470 --> 00:43:34,180 So you can think about something like very 957 00:43:34,180 --> 00:43:36,250 confident that way. 958 00:43:36,250 --> 00:43:43,340 There are eight -- like if six SPE is running and you first 959 00:43:43,340 --> 00:43:47,440 get something in there and SPE says divide it will take one 960 00:43:47,440 --> 00:43:50,066 task and run that until the other one to the master will 961 00:43:50,066 --> 00:43:53,070 finish it and here's my ID. 962 00:43:53,070 --> 00:43:55,560 Send me the message when it's done. 963 00:43:55,560 --> 00:43:56,510 And so you fork that end and wait. 964 00:43:56,510 --> 00:43:59,740 So you can assume you can do something like that. 965 00:43:59,740 --> 00:44:05,210 So it's almost master/slave but the coordination is there. 966 00:44:05,210 --> 00:44:11,215 The trouble with normally fork/join is if you create too 967 00:44:11,215 --> 00:44:12,420 many threads. 968 00:44:12,420 --> 00:44:14,190 You are in like a thread hell because there are too many 969 00:44:14,190 --> 00:44:15,640 things to run. 970 00:44:15,640 --> 00:44:17,760 I don't know, can you SPE? 971 00:44:17,760 --> 00:44:19,650 PROFESSOR: No. 972 00:44:19,650 --> 00:44:22,035 AUDIENCE: So you can't even do that because of some physical 973 00:44:22,035 --> 00:44:22,300 limitation. 974 00:44:22,300 --> 00:44:28,410 You can't get take up 1000 threads you run another 975 00:44:28,410 --> 00:44:34,180 master/slave thing yourself is because 1000 threads 976 00:44:34,180 --> 00:44:36,900 on top of your SPEs. 977 00:44:36,900 --> 00:44:39,820 And that's going to be locked threads. 978 00:44:39,820 --> 00:44:40,310 PROFESSOR: Yeah. 979 00:44:40,310 --> 00:44:43,420 Contact switching on the SPEs is very expensive. 980 00:44:43,420 --> 00:44:47,980 So on the PlayStation 3 you have six SPEs 981 00:44:47,980 --> 00:44:48,810 available to you. 982 00:44:48,810 --> 00:44:52,020 So if you have a lot more than six threads that you've 983 00:44:52,020 --> 00:44:55,190 created, essentially each one runs to completion. 984 00:44:55,190 --> 00:44:58,210 And then you swap that out and you bring in -- well, that 985 00:44:58,210 --> 00:44:59,090 terminates. 986 00:44:59,090 --> 00:45:02,200 You deallocate it from the SPE and you bring in a new thread. 987 00:45:02,200 --> 00:45:05,770 If you actually want to do more thread-like dynamic load 988 00:45:05,770 --> 00:45:08,720 balancing on the SPEs, it's not well suited for that. 989 00:45:08,720 --> 00:45:09,720 Just because the -- 990 00:45:09,720 --> 00:45:13,466 AUDIENCE: The best model there is master/slave. Because the 991 00:45:13,466 --> 00:45:14,990 PPE [UNINTELLIGIBLE PHRASE] 992 00:45:14,990 --> 00:45:15,370 the master part. 993 00:45:15,370 --> 00:45:18,080 It will run more sequential code. 994 00:45:18,080 --> 00:45:22,670 And when there's parallel send -- it will give it to you and 995 00:45:22,670 --> 00:45:25,820 produce the work queue model type and send stuff into SPE 996 00:45:25,820 --> 00:45:27,290 and feed that. 997 00:45:27,290 --> 00:45:30,750 So work queue type models can be used there. 998 00:45:30,750 --> 00:45:31,140 PROFESSOR: Yeah. 999 00:45:31,140 --> 00:45:33,820 And the SPMD model might not work really well because you 1000 00:45:33,820 --> 00:45:37,030 have this heterogeneity in the actual hardware, right. 1001 00:45:37,030 --> 00:45:39,530 So if I'm taking the same program running on the SPE 1002 00:45:39,530 --> 00:45:43,360 versus the PPE, that code might not be -- so I 1003 00:45:43,360 --> 00:45:45,010 essentially have to specialize the code. 1004 00:45:45,010 --> 00:45:47,670 And that starts to deviate away from the SPMD model. 1005 00:45:47,670 --> 00:45:51,565 AUDIENCE: Something I think most of the code you write for 1006 00:45:51,565 --> 00:45:55,020 Cell will probably be master/worker. 1007 00:45:55,020 --> 00:45:56,995 And if you try to do something other than you should think 1008 00:45:56,995 --> 00:45:59,556 hard why that's the case. 1009 00:45:59,556 --> 00:46:02,690 1010 00:46:02,690 --> 00:46:04,140 PROFESSOR: You can do fork/join but 1011 00:46:04,140 --> 00:46:05,120 you know, it's -- 1012 00:46:05,120 --> 00:46:07,490 AUDIENCE: I mean you can't -- 1013 00:46:07,490 --> 00:46:09,410 because you don't have virtualization. 1014 00:46:09,410 --> 00:46:11,938 If you fork too much where are you going to put those? 1015 00:46:11,938 --> 00:46:12,404 PROFESSOR: Right. 1016 00:46:12,404 --> 00:46:14,790 Sometimes you fork -- 1017 00:46:14,790 --> 00:46:15,685 AUDIENCE: Yeah. but in that sense you -- should 1018 00:46:15,685 --> 00:46:17,220 you fork too much? 1019 00:46:17,220 --> 00:46:19,330 To keep work you want the master. 1020 00:46:19,330 --> 00:46:20,590 You can fork things -- 1021 00:46:20,590 --> 00:46:22,770 you can do virtual fork and send the work to the master 1022 00:46:22,770 --> 00:46:24,070 and say, here, I forked something. 1023 00:46:24,070 --> 00:46:26,350 Here's the work. 1024 00:46:26,350 --> 00:46:30,370 I mean, the key thing is do the simplest thing. 1025 00:46:30,370 --> 00:46:33,320 I mean, you guys have two weeks left. 1026 00:46:33,320 --> 00:46:36,590 And if you try doing anything complicated, you might end up 1027 00:46:36,590 --> 00:46:38,730 with a big mess that's undebuggable. 1028 00:46:38,730 --> 00:46:39,780 Just do simple things. 1029 00:46:39,780 --> 00:46:45,400 And I can vouch, parallelism is hard. 1030 00:46:45,400 --> 00:46:47,600 Debugging parallel code is even harder. 1031 00:46:47,600 --> 00:46:51,300 So you're sort of trying to push the limits on the 1032 00:46:51,300 --> 00:46:55,345 complexity of messages going all over the world and the 1033 00:46:55,345 --> 00:46:57,615 three different types of parallelism all trying to 1034 00:46:57,615 --> 00:46:59,040 compete in there. 1035 00:46:59,040 --> 00:47:00,935 Just do the simple thing. 1036 00:47:00,935 --> 00:47:01,882 Just get the simple thing working. 1037 00:47:01,882 --> 00:47:05,005 First get the sequential code working and keep adding more 1038 00:47:05,005 --> 00:47:05,650 and more story. 1039 00:47:05,650 --> 00:47:08,430 And then make sure that each level it works. 1040 00:47:08,430 --> 00:47:10,350 The problem with parallelism is because things that 1041 00:47:10,350 --> 00:47:12,020 determine if some bugs that might show up. 1042 00:47:12,020 --> 00:47:13,950 Data might be hard. 1043 00:47:13,950 --> 00:47:17,770 But design absolutely matters. 1044 00:47:17,770 --> 00:47:21,000 Another thing I think, especially for doing demos and 1045 00:47:21,000 --> 00:47:24,570 stuff would be nice, would be to have a knob that basically 1046 00:47:24,570 --> 00:47:25,510 you can tune. 1047 00:47:25,510 --> 00:47:26,180 So you can say, OK, no SPEs. 1048 00:47:26,180 --> 00:47:29,624 Everything running in PPE one is two is. 1049 00:47:29,624 --> 00:47:32,410 So you can actually see hopefully in your code how how 1050 00:47:32,410 --> 00:47:36,693 things move for the demo part. 1051 00:47:36,693 --> 00:47:43,200 1052 00:47:43,200 --> 00:47:43,770 PROFESSOR: You had a question? 1053 00:47:43,770 --> 00:47:46,550 All right. 1054 00:47:46,550 --> 00:47:48,130 We'll take a brief break and do the 1055 00:47:48,130 --> 00:47:49,380 quizzes in the meantime. 1056 00:47:49,380 --> 00:48:50,863